asymptotic convergence-rate
The Asymptotic Convergence-Rate of Q-learning
In this paper we show that for discounted MDPs with discount factor, 1/2 the asymptotic rate of convergence of Q-Iearning if R(1 -,) 1/2 and O( Jlog log tit) otherwise is O(1/tR (1-1') provided that the state-action pairs are sampled from a fixed prob(cid:173) ability distribution. Here R Pmin/Pmax is the ratio of the min(cid:173) imum and maximum state-action occupation frequencies. The re(cid:173) sults extend to convergent on-line learning provided that Pmin 0, where Pmin and Pmax now become the minimum and maximum state-action occupation frequencies corresponding to the station(cid:173) ary distribution.
The Asymptotic Convergence-Rate of Q-learning
Q-Iearning is a popular reinforcement learning (RL) algorithm whose convergence is well demonstrated in the literature (Jaakkola et al., 1994; Tsitsiklis, 1994; Littman and Szepesvari, 1996; Szepesvari and Littman, 1996). Our aim in this paper is to provide an upper bound for the convergence rate of (lookup-table based) Q-Iearning algorithms. Although, this upper bound is not strict, computer experiments (to be presented elsewhere) and the form of the lemma underlying the proof indicate that the obtained upper bound can be made strict by a slightly more complicated definition for R. Our results extend to learning on aggregated states (see (Singh et al., 1995» and other related algorithms which admit a certain form of asynchronous stochastic approximation (see (Szepesv iri and Littman, 1996». Present address: Associative Computing, Inc., Budapest, Konkoly Thege M. u. 29-33, HUNGARY-1121 The Asymptotic Convergence-Rate of Q-leaming
The Asymptotic Convergence-Rate of Q-learning
Q-Iearning is a popular reinforcement learning (RL) algorithm whose convergence is well demonstrated in the literature (Jaakkola et al., 1994; Tsitsiklis, 1994; Littman and Szepesvari, 1996; Szepesvari and Littman, 1996). Our aim in this paper is to provide an upper bound for the convergence rate of (lookup-table based) Q-Iearning algorithms. Although, this upper bound is not strict, computer experiments (to be presented elsewhere) and the form of the lemma underlying the proof indicate that the obtained upper bound can be made strict by a slightly more complicated definition for R. Our results extend to learning on aggregated states (see (Singh et al., 1995» and other related algorithms which admit a certain form of asynchronous stochastic approximation (see (Szepesv iri and Littman, 1996». Present address: Associative Computing, Inc., Budapest, Konkoly Thege M. u. 29-33, HUNGARY-1121 The Asymptotic Convergence-Rate of Q-leaming